-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a check for trailing slash in webserver base_url #31833
Conversation
Signed-off-by: Hussein Awala <[email protected]>
airflow/models/taskinstance.py
Outdated
@@ -758,7 +758,7 @@ def generate_command( | |||
def log_url(self) -> str: | |||
"""Log URL for TaskInstance.""" | |||
iso = quote(self.execution_date.isoformat()) | |||
base_url = conf.get_mandatory_value("webserver", "BASE_URL") | |||
base_url = conf.get_mandatory_value("webserver", "BASE_URL").rstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this could lead to some errors in the future.
It's hard for someone to guess that retrieving the base_url
from the config is a special case and needs to strip trailing '/'. (I can easily imagine someone doing conf.get_mandatory_value("webserver", "BASE_URL")
to access the value)
Maybe we can put that in the config accessor ? (So we don't need to bother, the equivalent of @validates/@pre_process
for marshmallow field, that does some preprocesssing on the value before storing it)
Otherwise looking good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check how can we do that for a specific config
Two thoughts
|
it's possible for the changes in
I will try this one, and I don't think it's a braking change since it doesn't work with a trailing slash anyway |
Signed-off-by: Hussein Awala <[email protected]>
Signed-off-by: Hussein Awala <[email protected]>
@pierrejeambrun @uranusjr could you recheck it please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bugfix but could break some user configs by preventing webservers from starting.
Should we go with a warning first before actually refusing such configuration ?
(No strong opinion, not sure how we should handle this actually)
@pierrejeambrun As I mentioned in my previous comment:
Therefore, if users utilize a base_url with a trailing slash, they will encounter an issue, and comprehending the cause might not be straightforward. IMO, raising an exception is a secure approach since it does not break any existing functionality, but it just explain the problem reason. WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed that in your comment.
I did a few tests, looking good
Co-authored-by: Tzu-ping Chung <[email protected]>
* Remove right trailing / from webserver base_url Signed-off-by: Hussein Awala <[email protected]> * use url join instead of removing trailing slash Signed-off-by: Hussein Awala <[email protected]> * raise an exception when base_url contains a trailing slash Signed-off-by: Hussein Awala <[email protected]> * Update airflow/www/extensions/init_wsgi_middlewares.py Co-authored-by: Tzu-ping Chung <[email protected]> --------- Signed-off-by: Hussein Awala <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]> (cherry picked from commit fe4a6c8)
@@ -37,8 +38,11 @@ def _root_app(env: WSGIEnvironment, resp: StartResponse) -> Iterable[bytes]: | |||
|
|||
def init_wsgi_middleware(flask_app: Flask) -> None: | |||
"""Handle X-Forwarded-* headers and base_url support.""" | |||
webserver_base_url = conf.get_mandatory_value("webserver", "BASE_URL", fallback="") | |||
if webserver_base_url.endswith("/"): | |||
raise AirflowConfigException("webserver.base_url conf cannot have a trailing slash.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering here that why can the base URL not have a trailing /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to how web servers rewrite URLs when you deploy a webapp under a prefix. Say you deploy under https://mydomain/airflow
. When you visit https://mydomain/airflow/home
in the browser (as an example; any client applies), the proxy server (e.g. NGINX) would strip the prefix, and forward the rest of the URL to the app, so in Python code (e.g. Airflow) we can just handle /home
without needing to consider what prefix the entirety of Airflow is deployed under.
Now you may say, hey, why can’t either the proxy server, the gateway protocol (WSGI in Python’s case), or the web framework (Flask for Airflow’s case) be smarter and just strip or add the slash in between as appropriate? And you would be right! Some of them actually can do this (NGINX has merge_slashes
, for example), but not everyone does since the slash is technically significant and does change the URL’s meaning (even if that meaning can be nonsensical in the prefix), and that extra check is not free. So it’s better to avoid fighting the tools and just stick to the technically correct configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I am coming from here on how the python urllib.parse.urljoin
handles joining base URL and relative URLs.
Wouldn't https://my.astronomer.run/path
qualify as a valid base URL? But looks like the urljoin
just ignores the path when it does not end with a trailing /
. If it was https://my.astronomer.run/path/
, urljoin
does not strip the path and makes the join as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can handle more cases if users need it. This PR was a quick fix to prevent circular redirections.
I’m sure we can make this work with a trailing slash but requires a bit more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, forgot to reply.)
Python’s urljoin
implements the logic in <a href="...">
tags; I don’t recall the correct term for this, but it is the URL resolution logic when you click on links in a browser. URLs have two kinds, folder and file (not the right terms, but easier to understand). The URL would not have a trailing slash if you are in a file view; the trailing slash indicates a folder view. When you’re in a file, a relative path like foo
indicates another file in the same folder. When you’re in a folder view, however, foo
means the foo entry in the folder. This means the resolution logic would change depending on whether the URL has a trailing slash or not.
Note that Python’s local path joining logic (both pathlib and os.path) is different and does not consider the trailing slash; it instead matches the common logic used to join format in shell scripts.
But the prefix is an entirely different thing, and is intended to only be joined with simple string concatenation and skips all the folder/file logic because it is not semantically possible to support combining the prefix with absolute paths and relative paths. The only proper way to prepend a prefix is prefix + path
(or other string concatenation methods such as f"{prefix}{path}"
, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood, thank you @uranusjr for the detailed explanation and propsoal.
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment apache#31833 (comment) closes: apache#32996
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment #31833 (comment) closes: #32996
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment #31833 (comment) closes: #32996 (cherry picked from commit baa1bc0)
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment apache/airflow#31833 (comment) closes: #32996 (cherry picked from commit baa1bc0438baa05d358b236eec3c343438d8d53c) GitOrigin-RevId: 868d1389461822cf5d54faaff3ce19913fc7f08e
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment apache/airflow#31833 (comment) closes: #32996 GitOrigin-RevId: baa1bc0438baa05d358b236eec3c343438d8d53c
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment apache/airflow#31833 (comment) closes: #32996 GitOrigin-RevId: baa1bc0438baa05d358b236eec3c343438d8d53c
It is observed that urljoin is not yielding expected results for the task instance's log_url which needs to be a concatenation of the webserver base_url and specified relative url. The current usage of urljoin does not seem to be the right way to achieve this based on what urljoin is meant for and how it works. So, we use simple string concatenation to yield the desired result. More context in the comment apache/airflow#31833 (comment) closes: #32996 GitOrigin-RevId: baa1bc0438baa05d358b236eec3c343438d8d53c
closes: #31726
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.